Massive Data Analytics in the Cloud: TPC-H Experience on Hadoop Clusters
نویسنده
چکیده
NoSQL systems rose alongside internet companies, which have different challenges in dealing with data that the traditional RDBMS solutions could not cope with. Indeed, in order to handle efficiently the continuous growth of data, NoSQL technologies feature dynamic horizontal scaling rather than vertical scaling. To date few studies address On-Line Analytical Processing challenges and solutions using NoSQL systems. In this paper, we first overview NoSQL and adjacent technologies, then discuss analytics challenges in the cloud, and propose a taxonomy for a decision-support system workload, as well as specific analytics scenarios. The proposed scenarios aim at allowing best performances, best availability and tradeoff between space, bandwidth and computing overheads. Finally, we evaluate Hadoop/Pig using TPC-H benchmark, under different analytics scenarios, and report thorough performance tests on Hadoop for various data volumes, workloads, and cluster’ sizes.
منابع مشابه
Low Latency Geo-distributed Data Analytics – Public Review
Large cloud service providers ingest massive amounts of data in geographically distributed sites spread across the globe. Analytics for such planetary-scale datasets is an important emerging challenge. The current practice is to copy all data to a central location, where it can be dealt with locally by standard data analytics stacks such as Hadoop and Spark. However, transferring large volumes ...
متن کاملA Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection
Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....
متن کاملCloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملTowards a next generation of scientific computing in the Cloud
More than ever, designing new types of highly scalable data intensive computing is needed to qualify the new generation of scientific computing and analytics effectively perform complex tasks on massive amounts of data such as clustering, matrix computation, data mining, information extraction ... etc. MapReduce, put forward by Google, is a well-known model for programming commodity computer cl...
متن کاملParallelizing K-means with Hadoop/Mahout for Big Data Analytics
The rapid development of Internet and cloud computing technologies has led to explosive generation and processing of huge amounts of data. The ever increasing data volumes bring great values to societies, but in the meantime bring forward a number of challenges. Data mining techniques have been widely used in decision analysis in financial, medical, management, business and many other fields. H...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJWA
دوره 4 شماره
صفحات -
تاریخ انتشار 2012